Skip to content

Added RAY_SERVE_CONTROLLER_METRICS_INCLUDE_HIGH_CARDINALITY_TAGS to c…#63642

Open
johntaylor-cell wants to merge 5 commits into
ray-project:masterfrom
johntaylor-cell:serve-metrics-cardinality
Open

Added RAY_SERVE_CONTROLLER_METRICS_INCLUDE_HIGH_CARDINALITY_TAGS to c…#63642
johntaylor-cell wants to merge 5 commits into
ray-project:masterfrom
johntaylor-cell:serve-metrics-cardinality

Conversation

@johntaylor-cell
Copy link
Copy Markdown
Contributor

@johntaylor-cell johntaylor-cell commented May 26, 2026

This PR addresses his issue: #63114

API/Schema Changes

Added env var RAY_SERVE_CONTROLLER_METRICS_INCLUDE_HIGH_CARDINALITY_TAGS, default 1.

When set to 0, Serve controller autoscaling delay metrics drop high-cardinality source tags:
handle from ray_serve_autoscaling_handle_metrics_delay_ms and replica from ray_serve_autoscaling_replica_metrics_delay_ms.

Internal Call Sites

Updated ServeController metric gauge construction and emission in python/ray/serve/_private/controller.py.

Added helpers to keep gauge tag_keys and emitted tags consistent based on the env var.

Tests Added

Added unit coverage in python/ray/serve/tests/unit/test_controller.py for both modes:

default includes handle / replica
opt-out keeps only deployment / application

Tests Run

git diff --check
python -m py_compile changed Python files
python -m pytest python/ray/serve/tests/unit/test_controller.py -q -k autoscaling_metrics_delay_tags
Focused pytest result:
2 passed, 70 deselected in 0.18s

Backwards Compatibility

Fully backwards compatible by default. Existing metric labels remain unchanged unless users explicitly set RAY_SERVE_CONTROLLER_METRICS_INCLUDE_HIGH_CARDINALITY_TAGS=0.

…ontrol inclusion of high-cardinality tags from controller metrics

Signed-off-by: john.taylor <john.taylor@anyscale.com>
@johntaylor-cell johntaylor-cell requested a review from a team as a code owner May 26, 2026 16:11
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a feature flag RAY_SERVE_CONTROLLER_METRICS_INCLUDE_HIGH_CARDINALITY_TAGS to conditionally drop high-cardinality tags (such as replica and handle IDs) from controller-emitted metrics, helping to reduce metric volume in large deployments. The reviewer suggests optimizing the helper function _get_autoscaling_metrics_delay_tags by passing the high-cardinality key and value as separate arguments instead of a dictionary. This avoids unnecessary dictionary allocations on the hot path when high-cardinality tags are disabled. Corresponding updates are also suggested for the call sites and unit tests.

Comment thread python/ray/serve/_private/controller.py
Comment thread python/ray/serve/_private/controller.py
Comment thread python/ray/serve/_private/controller.py
Comment thread python/ray/serve/tests/unit/test_controller.py
Comment thread python/ray/serve/tests/unit/test_controller.py
Signed-off-by: john.taylor <john.taylor@anyscale.com>
@johntaylor-cell johntaylor-cell self-assigned this May 26, 2026
@ray-gardener ray-gardener Bot added serve Ray Serve Related Issue observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels May 26, 2026
@abrarsheikh
Copy link
Copy Markdown
Contributor

please link the GH issue this fixes in the description

Copy link
Copy Markdown
Contributor

@abrarsheikh abrarsheikh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: john.taylor <john.taylor@anyscale.com>
Signed-off-by: john.taylor <john.taylor@anyscale.com>
Signed-off-by: john.taylor <john.taylor@anyscale.com>
@johntaylor-cell
Copy link
Copy Markdown
Contributor Author

@johntaylor-cell johntaylor-cell added the go add ONLY when ready to merge, run all tests label Jun 2, 2026

def _get_autoscaling_metrics_delay_tag_keys(
high_cardinality_tag_keys: Tuple[str, ...],
) -> Tuple[str, ...]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why Tuple[str, ...] instead of str?


logger = logging.getLogger(SERVE_LOGGER_NAME)

_REPLICA_METRIC_BASE_TAG_KEYS = ("deployment", "application")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can exclude the changes in this file because these metrics are being emitted from the replica processes and not the controller. As a result it doesn't suffer from the same metric volume problem.

import pytest

from ray.serve._private.common import TargetCapacityDirection
from ray.serve._private.common import DeploymentID, TargetCapacityDirection
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use the test_metrics.py to test this out instead of relying on unit testing internal private functions.
test_metrics is a end to end test.


from ray._common.ray_constants import DEFAULT_MAX_CONCURRENCY_ASYNC
from ray._raylet import NodeID
from ray.serve._private import (
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here, let's use end to end tests.

"healthy, 0 means unhealthy."
),
tag_keys=("deployment", "replica", "application"),
tag_keys=_get_replica_lifecycle_metric_tag_keys(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: my personal preference is to just inline the logic to include high cardinality tags or not. current layout is not really achieving write-once-use-everywhere, we are still having to repeat this logic in various places. Also the logic is a bit hard to read/duplicatedwith tags/tag-keys.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants